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Abstract 



We propose geometric weighting as a novel method to combine multiple models 
in data compression. Our results reveal the rationale behind PAQ-weighting and 
generalize it to a non-binary alphabet. Based on a similar technique we present a new, 
generic linear mixture technique. All novel mixture techniques rely on given weight 
vectors. We consider the problem of finding optimal weights and show that the 
weight optimization leads to a strictly convex (and thus, good-natured) optimization 
problem. Finally, an experimental evaluation compares the two presented mixture 
techniques for a binary alphabet. The results indicate that geometric weighting 
is superior to linear weighting. 

1 Introduction 
1.1 Background 

The combination of multiple models is a central aspect of many modern data compression 
algorithms, such as Prediction by Partial Matching (PPM) [2, El E], Context Tree 
Weighting (CTW) flDlIII] or "Pack" (PAQ) AU of these algorithms belong to the 

class of statistical data compression algorithms, which share a common structure: The 
compressor consists of a model and a coder; and it processes the data (a string G for 
some alphabet > 2) sequentially. In the k-th. step, 1 < k < n, the model estimates 

the probability distribution P{ ■ \ x''~^) of the next symbol based on the already processed 
sequence x^~^ = X1X2 ■ ■ ■ Xk-i- The task of the coder is to map a symbol x & X to o. 
codeword of a length close to — log P{x \ x^~^) bits (throughout this paper log is to the base 
two). For decompression the coder maps the encoding, given P( ■ | x^~^)^ to x. Arithmetic 
Coding (AC) closely approximates the ideal code length and is known to be asymptotically 
optimal [3j. Therefore, the prediction accuracy of the model is crucial for compression. 

Mixture models or mixtures combine multiple models into a single model suitable for 
encoding. Let us now consider a simple example, which gives two reasons for our interest 
in mixtures. First, assume that we have m > 1 models available. Model i,l < i < m, 
maps an arbitrary x^ to a prediction Pi{x^) (a probability distribution), where 



and Pi{x^) > 0, 1 < i < m. A; > 0. When we compress x" with a single model i, we 
need to encode the choice of i in — log W{i) bits (where W{i) is the prior probability of 





selecting model i) and we need to store the encoded string, which adds — logPj(x") bits. 
If we knew a;" in advance, we could select 

t = arg min [-\og{W{j)) - \og{P,{x^))] . (2) 

i<j<m 

Surprisingly (as previously observed in e.g., [7]), a simple linear mixture P(x") : = 
Y.]Li will never do worse than (|2|, since 

-log(H^(0) -log(P,(x")) = -log(W^(OP,(x")) 

m 

i=i 

where i is the model that minimizes ([2]). Such a mixture makes it possible to combine 
the advantages of different models without cumulating their disadvantages. Secondly, 
the sequential processing allows us to refine the mixture adaptively (in favor of the locally 
more accurate models) . 

1.2 Previous Work 

Most of the major statistical compression techniques (PPM, CTW and PAQ) are based on 
mixtures. In PPM the concept of "escape" symbols is related to the computation of a recur- 
sively defined mixture distribution. The escape probability plays the role of a weight in a 
linear mixture. In |2] Bunton gave a very comprehensive (at that time) synopsis on that topic. 
Previously, several different methods for the estimation of escape probabilities had been 
proposed, e.g., PPMA, PPMB, PPMC, PPMD, PPMP, PPMX P, PPMII ^. CTW relies 
on the efficient combination of exponentially many (depending on a "tree depth" parameter) 
models for tree sources. However, the structure of PPM and CTW restrict the type of mod- 
els they combine (order- models for PPM and models for tree sources for CTW) . Recently, 
some of the techniques of CTW led to /3- weighting [1] , as a linear general-purpose weighting 
method. We are interested in general-purpose mixture techniques, which combine arbitrary 
(and eventually totally different) models. The practical success of this approach was initi- 
ated by Mahoney with PAQ (see |H] for details). PAQ combines a large amount of totally 
different models (e.g., models for text, for images, etc.). As a minor part earlier work we suc- 
cessfully employed a simple linear mixture model for encoding Burrows- Wheeler- Transform 
(BWT) output and proposed a method for the parameter optimization on training data [S]. 

1.3 Our Contribution 

In Section |3] we propose geometric weighting as a novel non-linear mixture technique. We 
obtain the geometric mixture as the solution of a divergence minimization problem. In 
addition we show that PAQ-mixing is a special case of geometric weighting for a binary 
alphabet. Since geometric weighting depends on a set of weights, we examine the problem 
of weight optimization and propose a corresponding optimization method. In Section |4] 
we focus on linear mixtures. In a fashion analogous to Section [3] we describe a new generic 
linear mixture and investigate the problem of weight optimization. Finally, we compare 
the behavior of the implementations (for a binary alphabet) of the two proposed mixture 
techniques and of /3-weighting in Section [5j Results indicate that geometric weighting 
is superior to the other mixture methods. 



2 Preliminaries 



First, we fix some notation. Let X denote an alpliabet of cardinality 1 < jA"! < oo and 
let xj = XjXj+i ... be a sequence of length n = j — i + 1 over X. For short we may 
write for x". Abbreviations such as (ai)i<i<n expand to (oi a2 ■ ■ ■ an) and denote row 
vectors. Boldface letters indicate matrices or vectors, denotes the transpose operator, 
1^ := (1 1 ... 1)^ e W and ilm := {v e \ v > 0, v'^lm = 1}. We use log to 
denote the logarithm with base two, In denotes the natural logarithm. 

Suppose that we want to compress a string x" G Af" sequentially. In every step 1 < k <n 
a model M : Uk>0'^'^ V maps the already known prefix x^~^ of x" to a model distribution 
P{ ■ I x'^-^), P~e P, where V := {Q : X ^ (0, 1) | ExeAf Q{x) = !}• An encoder trans- 
lates this into a code of length close to — log P(x | x'^~^) bits for x. Now, if there are m > 1 
submodels Mi , M2, . . . , (or submodels 1, 2, . . . , m, for short) , we require a mixture func- 
tion f^: X X V"^ (0, 1) to map the m corresponding distributions Pi, P2, . . . , Pm to a 
single distribution P(x) = fkix, Pi, P2, • • • , Pm), P € P, in step k; may depend on x^~^. 

An approach in information theory is to suppose that x" was generated by an unknown 
mechanism, which is called a source. W.l.o.g. we may assume that x" was generated 
sequentially: In every step k the source draws x according to an arbitrary source distri- 
bution P' E S := {Q : X ^ [0, 1] | J2xex Q{^) = 1} the distribution P' may vary 
from step to step) and appends it to x'^^^ to yield x*^ = x'^~^x. When we encode x, using 
a model distribution P G P, we obtain an expected code length of 



Ena:)log-^=E 

x&x ^ v-^/ xex 



P'(x)log ^ 



P'(x) 



xeX 



H(P') D(P'\\P) 

where H[P') is the source entropy and D[P' || P) is the KL-divergence [|3j, which measures 
the redundancy of P relative to P'. Our aim is to find a P, that minimizes the code 
length. Since H{P') is fixed (by the source), we want to minimize D{P' || P). We have 
D{P' II P) > 0, which is zero iff P = P', i.e., the best model distribution is the source 
distribution itself. 

3 Geometric Mixtures 

This section contains the major part of our work: We derive geometric weighting as a novel 
method for combining multiple models. Now suppose that we have m model distributions 
Pi, P2, . . . , Pm available in step k. Since the source distribution P' is unknown (if it exists 
at all) we try to identify an approximate source distribution P E S (IV, which we can use 
as a model distribution. It should be "close" (in the divergence-sense) to good models and 
"far away" from bad models. The terms good and bad refer to short and long code lengths 
(due to past observations and/or prior knowledge). We assume that we are given a set of 



non-negative weights Wi, 1 < i < m, J2iLi Wi > (in Section 3.2 we discuss a method of 
weight estimation), which quantify how well model i fits the unknown source distribution. 
Summarizing, we are looking for the distribution 

m 

P := a.igmin'^ WiD{Q \\ Pi). (3) 
'^^'^ 1=1 



3.1 Divergence Minimization 

In order to solve ([s]) we adopt the method of Lagrangian multiphers. First, we set 
Q{x I x''~^) = 9x and 6^ = {6x)xex to omit the imphcit dependence on k and to simphfy 
the equations. Now we rewrite ([s]) to yield 



min^U7i E [log(^^) - ^og{P,{x \ ^)) 

i=l x&X 

s.t. ^6x = l and 6^^ > 0, a; G A" 



and formulate its Lagrangian 



(4) 



L(0,A,m) =E^^ E [log(e,.) -log(P^(x I x''-')) 

i=l x&X 



- A f 1 - E - E (/^^^x.) . 



The variable A and the vector ^ = {fix)x(^x denote the Lagrange multipliers. A local 
minimum 6*, X*,i-i* satisfies the Karush-Kuhn- Tucker (KKT) conditions (see, e.g. [I]) 



dL{e* , X* , ^1*) 

Mr. 



0, 



for all X E X and 



e: > 0, /x: > 0, eifii = 



E C = 1- 

xeX 



(5) 
(6) 

(7) 



Due to ^ we obtain /x* = for all x G A". Equation ^ can be transformed to 



E j + A + log [Qi^^-^ j = log n 



k—l\Wi 



i[X X 



Now we fix a disjoint pair x 7^ x' of symbols from X and subtract the corresponding 
instances of ([8|, which results in 



01' = 0:11 



i=l I 



PAx' X 



I I 



PAx I X 



where w', 



Wi 



(9) 



Again, we fix a single character x and substitute any other occurrence of x' 7^ x in ([T]) 
via ([9]) . Thus we have 



^ = o:+o: E n 

x'eX\{x} i=l 

which we rewrite to yield 



PAx X' 



I I 



PAx I X 



91 



(10) 



Finally, we reintroduce the dependencies on k and obtain the geometric mixture 



P(x I X 



k-l\ 



fk{x, Pi, P2, ■ ■ . , P„ 



fir 



where w'^ = {wi)i<i<m is composed of the non-negative weights Wi. It remains to show 
that (10) minimizes (|4]). For this, we observe that the Hessian of ^ is 



w'^l^rn ■ diag((l/6'^)^.gA') 

which is positive definite, since 6^ > for all x G A". 
3.2 Weight Estimation and Convexity 



The mixture function (11) requires m non-negative weights w 



which we 



still need to obtain. In our situation the sequence is known (and fixed) and the sequence 
probability is given as a function of w as 



n fkixk, Pi, P2, . . . , Pm) = n E..erriI^\Pi(x' |a;'=-i)->/(»"i'")' 



(12) 



fc=i 



We now wish to find a weight vector w, which maximizes ( 12 ) (a maximum- likelihood esti- 



mation) . Since a maximization of the sequence probability is equivalent to a minimization 
of its code length, we may alternatively solve 



n 



UT=lP^i^k\x' 



_ , , ;,fc-l^|U.,/{«;^l„) 



(13) 



We define w* to be the minimizer of ( 13 ) 



Now we want to show that the cost function of ( 13 ) is convex. Since the cost function is 



a sum, we analyze a slight modification of a single term l{w) := — \n{g{w)/h{w)) (since 
log(x) ~ ln(x)). W.l.o.g. we may assume that w E f)m (due to ([9])). In order to simplify 
the analysis of the Hessian of l{w) we set 



h{w) 

Q{xf 

Px 



II L 

i=l 

m 



x£X i=l xex 



(lnPj(x I X )) 

l<i<mi 

x'<^X 



and we obtain 



Vh{w)/h{w) = J2 PccQix) 



xex 



V^g{w)/g{w) = Qixk)Q{xkf, 

v^h{w)/h{w) = J2p-Qi^)Q^ 

xex 



x] 



The Hessian of l{w) is positive definite, since ioi v ^ 0,v E 



g{wY g{'^) h{w) h{w] 



g{w) J g{w) h{w) \ h{w) 



AX] 



xeX ^ ^ \x<^X } 

> 

holds, where the last line is due to Jensen's inequality (since J^x^xPx = !)• It follows that 
the problem (13) is strictly convex and there exists a single global minimizer w* G ^2^- 
We solve the problem ( 13 ) with an optimization method tailored to a natural requirement 
in statistical compression: The sequence to be compressed is processed only once. Since 
the cost function is convex, the optimization algorithm does not need strong global search 
capabilities. A possible method-of-choice is an instance of iterative gradient descent 
[T]. In the fc-th step we use the estimates w{k) in place of w* (in ( 11 )). Initially we set 
w{0) = l/m ■ Ijn- In each step k we adjust the weight vector w{k — 1) after we observe 
Xk via a step towards the direction of steepest descent, i.e., 

-«fcV„ (- log /fc(Xfc, Pi, P2, . . . , Pm)) (19) 

where > is the step size in the fc-th step. The choice of ak is crucial for the convergence 



ofw{k) to w* m (see Sections 3.3 and[5| . In the case of a geometric mixture function we have 

(QiXk) - gxfelm) - Ex<^xPx (Qix) - Qxlm) \ 



w(k) := max {elm, wik - 1) + ak ,,,, 

1 



m 



where '■= {w'^Q{x))/{w'^lm)- As an implementation detail e > is a small constant 
to bound the weights away from zero and to avoid a division by zero in (11 ). 

3.3 PAQ Mixtures or Geometric Mixtures for a Binary Alphabet 

Before we examine the details of "the" PAQ mixture method, we need to clarify that there 
exist multiple PAQ mixture mechanisms [8J . We focus on the latest instance, which was 
introduced in 2005 as a part of PAQ7. PAQ computes mixtures for a binary alphabet 
and works with the probability of one-bits. The mixture is defined as follows 



Ml, Pi,P2,..., Pm) := sq [J2 - 1) st(P.(l | x''~'))j , (20) 

Wiik) := w,ik - 1) + aixk - fkil. Pi, P2, . . . , Pm)) st(P,(l)), (21) 
where Xk is the bit we observed in step k and 

st(x) := In sq(x) := — (22) 

I — X 1 + e ^ 



Let w'^ = (wi)i<i<m be the weight vector in step k where we assume that w G ilm- Now 

m 



we rewrite (20) (due to (22)) to yield 

1 + exp 



/fe(l, -Pi, P2, ■ ■ ■ , Pm) 



1+n 



\ 1=1 

i-Pd 



Piil 


1^ 


l-P^{l 





1 -1 



X 



k-U 



1=1 



P(l 



X 



k~V 





^k-l'jWi 


UT=lP^iO 


x'=-i)-« + n™ 1^.(1 


^k—l^Wi 



which matches (11). It is easy to check (via substituting (20) into (19)), that (21) is 
an instance of iterative gradient descent, where = a is constant in any step and the 
max-operation is omitted. When a is sufficiently small, the sequence {w{k))k>i converges 
to some Wa rather than the optimal solution w*. In turn, limQ,_>o '^a = w* [Tj. A (small) 
constant step size a thus needs to be determined experimentally. 

4 Linear Mixtures 



Let us return to the setting of Section 1.1 Instead of encoding x" with model i and trans- 
mitting our choice in — log W[i) bits, we will not do worse using the mixture distribution 

m 

P(x") := ^iy(z)i^,(x"). 

1=1 

Since we want to process sequentially we use the distribution (cf. (fl 



Pix 



k-l 



X] 



Pix^-^] 



T.T=iPi{x^~^x)W{i 

P{x^-^) 

„k~l\ 



i=l 



Pix^~^] 



X 



A:-l^ 



PAx I X 



k-U 



(23) 



i=l 



in step k. There is an obvious interpretation for the mixture (23). Suppose that there are 
m sources and a probabilistic switching mechanism, which selects source i with probability 



X 



fc-l^ 



in step k (we interpret this as the posterior probability of i given x 



fc-l^ 



When a source is selected, it appends a character x (with probability Pi{x \ x^~^)) to the 



„k-l 



sequence x" " to yield x 
4.1 /3- Weighting 



X 



k-l 



X. We denote such a source as a switching source. 



We can modify the probability assignment of (23 ) to yield a linear mixture technique called 
P -weighting., which has its roots in the CTW compression technique and was proposed 
in [1] . /3- weighting is defined by 



fk{x,Pi,P2, 



5 Pm) 



Y,l3i{k)P,{x I X 
1=1 



fc-l^ 



f3-(k) := W{i I X 



k-U 



fc-l^ 



i^(x_ 

Pix''-^] 



After the character is known, we can compare Pi{k) and Pi{k — 1) and observe, that 



^ Pi{Xk I x^~^ 
fk{Xk, Pi, P2, ■ ■ ■ 

4.2 Generic Linear Weighting 



and A(0) = W{i) 



(24) 



With the method of Lagrangian multiphers (see Section 3.1 ) we can show that (in step k) 



P :-- 



m m 

argmin^ WjD(Pj || Q), where Wi > 0,1 < i < m, and ^ > 0, (25) 



1=1 



i=l 



yields the linear mixture 



P(x 



„A,•-l^ 



X 



fc-l^ 



where w', :- 



. ^) = fkix,Pi,P2,...,Pm) ■=^w'iPi{x 

1=1 ^1=1 

In the setting of the previous section the normalized weights w'^ correspond to the switching 
probabilities W{i \ x''~^). Thus, the cost function in (25) would be proportional to the 
expected redundancy of a switching source in step k. 

It is important to understand the difference between (jsj) and ( 25 ) . In ^ Pi plays the 
role of a model distribution and we seek an approximate source distribution, which we 
can use as a model distribution. On the other hand, in (25) Pi plays the role of a source 
distribution and we seek a model distribution, which matches our assumptions on the 
specific source structure (namely, a switching source). We belief that the assumptions 
of ([3]) are inferior to those of (25), hence the geometric mixture is more general. 

In analogy to Section 3.2 we look for a weight vector w*, which minimizes the code 
length of the sequence we want to compress, i.e., 

12i=lWiPi{Xk \ X^ 



w 



arg min ^ — log 

™ k=l 



YZlW^ 



(26) 



T 



[Wi)l<i<r 



First we analyse the convexity properties of ( 26 ) . W.l.o.g. we assume that w 
is an element of Vtm- The convexity properties of (26) follow from the analysis of a single 
term of the sum, which is proportional to 



l{w) 



-In 



w'^P{xk) wen.„ 



In 



1 



W^P{Xk)' 



where Pixk) 



Xk I X ))l<i<m* 



The Hessian of l{w) is positive definite, since 



TP{Xk)P{XkY 
{wTp{xkW 



V 



'v^Pjxk) 

w'^P{Xk) 



> 



holds for i> 7^ 0, v E 



We conclude that the problem (26) is strictly convex. Thus, 



there exists a single global minimizer w* E ^m- As in Section 3^ we can obtain a weight 
update rule via iterative gradient descent 



w(k) := max < elm, w{k — 1) 



P{Xk) - fk{Xk, Pi, P2, • • • , Pm) ■ Ir 



fk{Xk, Pi, P2, 



, Pro) 



(27) 



where wi 



\T ._ 



1/m ■ Im and e is a small positive constant. It is interesting to note, that 
when we replace with the matrix diag(it>(A; — 1)) and omit the max-operation, (27) 
turns into /3- weighting (cf. (24)) and w{k) E VLm, k > 0. 



5 Experiments 

In this section we compare the performance of a geometric mixture (GEO), a generic hnear 
mixture (LIN) and /3-weighting (BETA) on the files of the weU-known Calgary Corpus. We 
have implemented the weighting techniques for a binary alphabet. To process non-binary 
symbols (here, bytes) we employ an alphabet decomposition. Every symbol G A" 
is processed in iV = [log \X\] intermediate steps, for details see, e.g., [S]. To ensure a 
fair comparison, the set of models is the same for any mixture method: There are seven 
finite-order context models (the probability estimations are conditioned on order-0 to 
order-6 contexts). The eighth model is a match model. In step k it searches the longest 
matching substring x^z]^ of length L > 7 in x''~'^. In the case of a match it predicts 
the symbol (here, each bit in the intermediate steps), which succeeds the matching 
substring with probability 1 — l/L, otherwise each symbol receives the probability l/jA"!. 

For each mixture technique we select a weight vector w based on an order- 1 context 
and on the match length L (determined by the match model in every step k). Initially any 
weight vector is initialized to 1/m • 1^- After a weight update we ensure that w > e-lm (we 
set e = 2-30) w'^lm = 1. For 13 -weighting we can confirm the observation made in jlj : 
The weights must be bounded considerably away from zero, i.e., Pi > e (we set e = 2~^). A 
weight update based on iterative gradient descent requires a step size a^. We set = 1/16 
(GEO) or ttk = 1/32 (LIN), respectively. The step size (for GEO and LIN) and e (for BETA) 
were determined experimentally for maximum compression. We did not notice significant 
changes in compression, when the step size was sufficiently small (in the scale of 10"^). 

Table [T] summarizes our experimental results. GEO outperforms LIN and BETA in almost 
every case, expect for the file objl, where the compression is roughly 2% worse than LIN 
and BETA. On average LIN compresses about 2% and BETA compresses about 3.6% worse 
than GEO, respectively. When we compare LIN and BETA we see that BETA produces worse 
compression in every case, 1.5% on average. Summarizing we may say that GEO works 
better than LIN. In our experiments BETA is inferior to the other weighting techniques. 

6 Conclusion 

In this paper we introduced geometric weighting as a new technique for computing 
mixtures in statistical data compression. In addition we introduced a new generic linear 
weighting strategy. We explain which assumptions the weighting techniques are based 
on. Furthermore, our results reveal that PAQ is an instance of geometric weighting for 
a binary alphabet. All of the presented mixture techniques rely on weight vectors. It turns 
out that in any of the two cases the weight estimation is a good-natured problem since 
it is strictly convex. An experimental study indicates that geometric weighting is superior 
to linear weighting (for a binary alphabet). 

For future research it would be interesting to obtain statements about the situations where 
geometric weighting outperforms linear weighting (and vice- versa). Another topic is how to 
select a fixed number of submodels for maximum compression. This leads to the optimiza- 
tion of model and mixture parameters (and to the question, whether or not, the optimization 
problem remains convex). Such a question is very natural, since we wish to maximize 
the compression with limited resources (CPU and RAM). Combining multiple models in 
data compression is highly successful in practice, but more research in this area is needed. 
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Table 1: Compression rates in bpc on the Calgary Corpus for geometric- (GEO), generic 
linear- (LIN) and /3-weighting (BETA), best results are typeset boldface. 



t lie 


GEO 


LIN 


BETA 


010 


1.816 


1.89(J 


1.907 


bookl 




Z.OU4 


z.oLo 


book2 


1.864 


1.943 


1.965 


geo 


4.407 


4.423 


4.501 


news 


2.286 


2.347 


2.412 


objl 


3.672 


3.603 


3.610 


obj2 


2.224 


2.240 


2.298 


paperl 


2.274 


2.327 


2.343 


paper2 


2.220 


2.288 


2.310 


pic 


0.813 


0.871 


0.922 


progc 


2.276 


2.327 


2.361 


progl 


1.558 


1.607 


1.651 


progp 


1.610 


1.638 


1.669 


trans 


1.384 


1.430 


1.453 


Average 


2.187 


2.231 


2.265 
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